256

17

Genomics

assigned to the genes or groups of genes whose transcription they control. Other

tasks include the identification of those genes (in humans, mammals, etc.) believed

to originate from viruses and the localization of hypervariable regions (e.g., those

coding for immunoglobulins). Ultimately, the aim is to be able to understand the

relationships among the various elements of the genome.

Gene prediction can be divided into intrinsic (template) and extrinsic (lookup)

methods. The former are the best candidates for leading to fundamental insight into

how the gene works; if they are successful, they should furthermore then inevitably

provide the means to generalize from the biochemistry of natural sequences to yield

rules for designing new genes (and genomes) to fulfil specified functions. We shall

begin, however, by considering the conceptually simpler extrinsic methods.

17.4

Extrinsic Methods

The principle of the extrinsic or lookup method is to identify a gene by finding a

sufficiently similar known object in existing databases. Hence, the method is based

on sequence similarity (to be discussed in Sect. 17.4.2), using the still relatively

small core of genes identified by classical genetic and molecular biological studies

to prime the comparison; that is, a gene of an unknown function is compared with the

database of sequences with a known function. This approach reflects a widely used,

but not necessarily correct (or genuinely useful), assumption that similar sequences

have similar functionalities. 15 A major limitation of this approach is the fact that,

at present, about a third of the sequences of newly sequenced organisms turn out to

match no sufficiently similar known sequences in existing databanks. Furthermore,

errors in the sequences deposited in databases can pose a serious problem.

17.4.1

Database Reliability

An inference, especially a deductive one, drawn from data is only as good as the data

from which it is formed. The question of the reliability of the data is certainly a matter

for legitimate concern. The most pernicious errors are wrong nucleic acid bases in

a sequence. The sources of such errors are legion and range from experimental

uncertainties to mistakes in typing the letters into a file using a keyboard. Of course,

these errors can be considered as a source of noise (i.e., equivocation) and handled

with the ideas developed earlier, especially in Chap. 7. Undoubtedly, there is a certain

redundancy in the sequences, but these questions of equivocation and redundancy in

15 Note that “homology” is defined as “similarity in structure of an organ or molecule, reflecting

a common evolutionary origin”. Sequence similarity is insufficient to establish homology, since

genomes contain both orthologous (related via common descent) and paralogous (resulting from

duplications within the genome) genes.